Search CORE

6 research outputs found

Paraphrase Generation and Evaluation on Colloquial-Style Sentences

Author: Creutz Mathias
Scherrer Yves
Sjöblom Eetu Ilari
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/05/2020
Field of study

This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.In this paper, we investigate paraphrase generation in the colloquial domain. We use state-of-the-art neural machine translation models trained on the Opusparcus corpus to generate paraphrases in six languages: German, English, Finnish, French, Russian, and Swedish. We perform experiments to understand how data selection and filtering for diverse paraphrase pairs affects the generated paraphrases. We compare two different model architectures, an RNN and a Transformer model, and find that the Transformer does not generally outperform the RNN. We also conduct human evaluation on five of the six languages and compare the results to the automatic evaluation metrics BLEU and the recently proposed BERTScore. The results advance our understanding of the trade-offs between the quality and novelty of generated paraphrases, affected by the data selection method. In addition, our comparison of the evaluation methods shows that while BLEU correlates well with human judgments at the corpus level, BERTScore outperforms BLEU in both corpus and sentence-level evaluation.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Toward automatic improvement of language produced by non-native language learners

Author: Creutz Mathias
Sjöblom Eetu Ilari
Publication venue: 'Linkoping University Electronic Press'
Publication date: 30/09/2019
Field of study

It is important for language learners to practice speaking and writing in realistic scenarios. The learners also need feedback on how to express themselves better in the new language. In this paper, we perform automatic paraphrase generation on language-learner texts. Our goal is to devise tools that can help language learners write more correct and natural sounding sentences. We use a pivoting method with a character-based neural machine translation system trained on subtitle data to paraphrase and improve learner texts that contain grammatical errors and other types of noise. We perform experiments in three languages: Finnish, Swedish and English. We experiment with monolingual data as well as error-augmented monolingual and bilingual data in addition to parallel subtitle data during training. Our results show that our baseline model trained only on parallel bilingual data sets is surprisingly robust to different types of noise in the source sentence, but introducing artificial errors can improve performance. In addition to error correction, the results show promise for using the models to improve fluency and make language-learner texts more idiomatic.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Annotation of subtitle paraphrases using a new web tool

Author: Aulamo Mikko Juhani
Creutz Mathias Johan Philip
Sjöblom Eetu Ilari
Publication venue: CEUR-WS.org
Publication date: 17/05/2019
Field of study

This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Paraphrase Detection on Noisy Subtitles in Six Languages

Author: Aulamo Mikko Juhani
Creutz Mathias Johan Philip
Sjöblom Eetu Ilari
Publication venue: The Association for Computational Linguistics
Publication date: 01/11/2018
Field of study

We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Grammatical Error Generation Based on Translated Fragments

Author: Creutz Mathias
Sjöblom Eetu Ilari
Vahtola Teemu
Publication venue: 'Linkoping University Electronic Press'
Publication date: 31/05/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Coping with Noisy Training Data Labels in Paraphrase Detection

Author: Creutz Mathias
Itkonen Sami
Sjöblom Eetu Ilari
Vahtola Teemu
Publication venue: The Association for Computational Linguistics
Publication date: 11/11/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto